Estimating Jaccard Index with Missing Observations: A Matrix Calibration Approach
نویسنده
چکیده
The Jaccard index is a standard statistics for comparing the pairwise similarity between data samples. This paper investigates the problem of estimating a Jaccard index matrix when there are missing observations in data samples. Starting from a Jaccard index matrix approximated from the incomplete data, our method calibrates the matrix to meet the requirement of positive semi-definiteness and other constraints, through a simple alternating projection algorithm. Compared with conventional approaches that estimate the similarity matrix based on the imputed data, our method has a strong advantage in that the calibrated matrix is guaranteed to be closer to the unknown ground truth in the Frobenius norm than the un-calibrated matrix (except in special cases they are identical). We carried out a series of empirical experiments and the results confirmed our theoretical justification. The evaluation also reported significantly improved results in real learning tasks on benchmark datasets.
منابع مشابه
Preferred Robust Response Surface Design with Missing Observations Based on Integrated TOPSIS-AHP Method
- Missing observations occur in experimental designs as a result of insufficient sampling, machine breakdown, high cost, and errors in the measurements. In nanomanufacturing, missing observations often appear in designs because the combination of factors or molecular structures selected by a designer cannot be experimented successfully. In the current paper, Box-Behnken and face-centered compos...
متن کاملDEA with Missing Data: An Interval Data Assignment Approach
In the classical data envelopment analysis (DEA) models, inputs and outputs are assumed as known variables, and these models cannot deal with unknown amounts of variables directly. In recent years, there are few researches on handling missing data. This paper suggests a new interval based approach to apply missing data, which is the modified version of Kousmanen (2009) approach. First, the prop...
متن کاملA new statistical approach for assessing similarity of species composition with incidence and abundance data
Anne Chao, Robin L. Chazdon, Robert K. Colwell and Tsung-Jen Shen Institute of Statistics, National Tsing Hua University, Hsin-Chu, Taiwan Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs, CT, USA *Correspondence: E-mail: [email protected] Abstract The classic Jaccard and Sørensen indices of compositional similarity (and other indices that depend upon the same v...
متن کاملOn the Normalization and Visualization of Author Co-Citation Data Salton's Cosine versus the Jaccard Index
The debate about which similarity measure one should use for the normalization in the case of Author Co-citation Analysis (ACA) is further complicated when one distinguishes between the symmetrical co-citation—or, more generally, co-occurrence— matrix and the underlying asymmetrical citation—occurrence—matrix. In the Web environment, the approach of retrieving original citation data is often no...
متن کاملb-Bit Minwise Hashing for Estimating Three-Way Similarities
Computing1 two-way and multi-way set similarities is a fundamental problem. This study focuses on estimating 3-way resemblance (Jaccard similarity) using b-bit minwise hashing. While traditional minwise hashing methods store each hashed value using 64 bits, b-bit minwise hashing only stores the lowest b bits (where b ≥ 2 for 3-way). The extension to 3-way similarity from the prior work on 2-way...
متن کامل